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Summary. Combining matching and regression for causai inference provides doubie-robustness in 
removing treatment effect estimation bias due to confounding variabies. In most reai-worid appiica- 
tions, however, treatment and controi popuiations are not iarge enough for matching to achieve perfect 
or near-perfect baiance on aii confounding variabies and their noniinear/interaction functions, ieading 
to trade-offs, [this fact is independent of regression, so a bit disjointed from first sentence.] Further¬ 
more, variance is as important of a contributor as bias towards totai error in smaii sampies, and must 
therefore be factored into the methodoiogicai decisions. In this paper, we deveiop a mathematicai 
framework for quantifying the combined impact of matching and iinear regression on bias and vari¬ 
ance of treatment effect estimation. The framework inciudes expressions for bias and variance in a 
misspecified iinear regression, theorems regarding impact of matching on bias and variance, and a 
constrained bias estimation approach for quantifying misspecification bias and combining it with vari¬ 
ance to arrive at totai error. Methodoiogicai decisions can thus be based on minimization of this totai 
error, given the practitioner’s assumption/beiief about an intuitive parameter, which we caii ‘omitted R- 
squared’. The proposed methodoiogy exciudes the outcome variabie from anaiysis, thereby avoiding 
overfit creep and making it suitabie for observationai study designs. Aii core functions for bias and 
variance caicuiation, as weii as diagnostic toois for bias-variance trade-off anaiysis, matching caiibra- 
tion, and power anaiysis are made avaiiabie to researchers and practitioners through an open-source 
R iibrary, MatchLinReg. 

Keywords: causai inference, observationai studies, iinear regression, propensity score match¬ 
ing, Mahaianobis matching, bias, variance 

1. Introduction 

Despite a rich body of literature on causal inference in observation studies, some important ques¬ 
tions remain unanswered. Motivated by the desire to produce unbiased estimation of treatment 
effect (TE) in the presence of confounding variables, theoretical work in this field has largely focused 
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on large-sample, bias-removal properties of techniques such as matching and regression (Abadie and 


Imbens, 2006, 2011). In applied settings, however, sample sizes are finite and often small (including 


control groups), asymptotic properties are not realized, and variance is as important as bias in con¬ 
tributing towards total estimation error. This is particularly true in medical applications such as 


studying the effectiveness of a novel in-patient heart procedure (Kereiakes et al., 2000), where not 
only randomized treatment assignment is unfeasible due to ethical or logistical concerns, but data 
collection is costly and time-consuming, and thus early identification of treatment effectiveness is 
highly valuable. 

In particular, the general recognition by researchers of the added benefit of combining matching 


and regression to achieve the so-called ‘double-robustness’ (Rubin, 1973; Carpenter, 1977, Rubin 


1979 

Robins and Rotnitzky, 1995 

Heckman et al. 

1997; 

Rubin and Thomas, 2000 

Glazerman 


et ah, 2003 Abadie and Imbens, 2006 Ho et al. 2007; Abadie and Imbens 2011) has yet to 


translate into a set of guidelines and tools that allow for their effective utilization by practitioners. 
For example, what caliper size should be used in matching? What is the impact of the choice of 
matching technique, e.g. Mahalanobis distance matching vs. propensity score matching (PSM), on 
TE estimation? What terms should be included in matching (e.g. in a logistic regression model 
used for generating propensity scores)? How do such decisions affect not only TE estimation bias 
but also variance? At what point is study power diminished beyond usefulness when matching is 
applied more and more restrictively to reduce bias? Simulation studies have shed light on some of 
these questions empirically (]Austin et al. 2007; Austin, 2009; Gayat et ah, 2012), and providing 


a theoretical foundation can facilitate generalizable conclusions from such empirical work (Imbens 


2004). 


In this paper, we develop the mathematical framework for quantifying the combined effect of 
matching and linear regression on TE estimation bias and variance in finite samples with arbitrary 


covariate distributions. Similar to the approach of Ho et al. (2007), we consider matching as a non- 
parametric pre-processing step prior to regression adjustment. We cast the model misspecification 
problem as a covariate omission problem, and derive closed-form expressions for TE estimation bias 
and variance in linear regression adjustment. Our expressions for bias and variance are useful in 
three capacities: 


1. Theory: Using the normalized form of bias expression, we prove that perfect matching on 
included and omitted covariates eliminates bias. Eurthermore, perfect matching on included 
covariates minimizes variance, given a fixed number of treatment and control observations. 
The variance-minimization property of matching provides theoretical support for the notion 
that study power is not lost as rapidly when data loss is due to matching, compared to data 
loss that is random. 

2. Simulations: The closed-form expressions for bias and variance can be efficient and accurate 
















































































































Matching and linear regression 3 

substitutes for Monte Carlo simulations for studying matching and linear regression. We have 
used these expressions for most of the simulations presented in this paper, and expect that 


future empirical studies in the field will benefit from this efficient simulation tool. 

3. Diagnostic and calibration software: Our expressions can be utilized in a diagnostic capacity 
to analyze bias-variance trade-off resulting from various choices of matching and regression 
parameters, and to quantify the impact of matching on study power. By utilizing such diag¬ 


nostic tools - embedded in the open-source R package MatchLinReg (Mahani and Sharabiani 


2015) - practitioners will be able to make better-informed decisions and the quality of causal 


inference in observational studies will be enhanced as a result. 


The rest of this paper is organized as follows. We continue by reviewing previous work on 
combining matching and regression for causal inference, introduce data sets and software used 
in the paper, and present simulation results using real data sets to highlight the benefits and 
complexities associated with combining matching and regression, which motivated our research. In 
Section we present our mathematical framework. We outline the scope and philosophy of the 
paper, formally define the problem, derive equations for TE bias and variance estimation using 
linear regression, and offer theorems regarding the impact of matching on bias and variance. In 
Sectionj^ we use the developed framework to create diagnostic and calibration tools. We begin with 
exploratory diagnostics such as relative squared bias reduction due to single omitted covariates, and 
comparison of normalized single-covariate biases, and proceed to generate an aggregate measure of 
normalized bias using constrained bias estimation methodology. We then combine normalized bias 
and variance to arrive at normalized MSE, using a conversion factor called ‘omitted R-squared’. 
Minimizing MSE allows us to select the matching parameters, a process which we refer to as 
calibration. We close this section by reviewing the open-source R package, MatchLinReg, that 
contains the core framework as well as the diagnostic and calibration tools. Einally, in Section 
we provide a summary of contributions in this paper, and outline pointers for promising future 
research directions. 

A list of mathematical symbols used throughout the paper can be found in Appendix 
Erequently-used acronyms are listed in Appendix Details of mathematical derivations and the¬ 
orem proofs are provided in the remaining appendices. 


1.1. Previous work 

There is a rich body of literature on causal inference for observational studies, and some have fo¬ 


cused - partially or wholly - on combined use of matching and regression adjustment. Rubin (1973) 
derive expressions for bias reduction for a single confounder, considering parallel and non-parallel 


response surfaces. Carpenter (1977) analyze the performance of nearest-neighbor matching, and 


its combination with regression adjustment. While their primary focus is on normally-distributed 
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covariates, they present an Analysis-of-covariance (ANCOVA) equation for variance, which is equiv¬ 


alent to our standard expression (Eq. 26b). Rubin (1979) study the effect of various approaches 


for combining matching and regression adjustment for controlling bias, and conclude that covariate 


matching combined with regression on paired differences performs best. Rubin and Thomas (1992 


1996) study bias reduction resulting from propensity score matching for normally-distributed co¬ 


variates. Their expressions rely on correlation between covariates and outcome, [not all of them; 


edit this sentence] Rubin and Thomas (2000) study the bias-removal impact of combining propen¬ 
sity score matching with regression adjustment and conclude that the combination has superior 


performance than regression alone. Ho et al. (2007) use real data sets to illustrate that using 


matching as a non-parametric pre-processing step reduces model dependence in parametric causal 
inference. While cautioning that increased variance from data loss can outweigh reduced bias in 
matching, they also point out that ‘no precise rule exists for how to make these [bias-variance 


tradeoff] choices.’ Schafer and Kang (2008) use simulated data based on Add Health study (Udry 


and Bearman, 1998) to compare the performance of various combinations of ANCOVA and propen¬ 


sity score matching techniques. Hosman et al. (2010) present methods for sensitivity analysis of 


regression, potentially combined with propensity-score-based stratification, to confounder omission. 


Abadie and Imbens (2011) prove that adding a bias-correcting step such as least-squares regression 


to nearest-neighbor matching renders it V^/^-consistent, but their analysis is asymptotic in nature, 
and is focused on matching with replacement which is typical of econometric literature. 


1.2. Setup 


We use the statistical software R (R Core Team, 2015) for all simulations presented in this paper. 


Two observational data sets form the basis of our simulations: lalonde ( [LaLond'e 1986) and lind- 
ner (]Kereiakes et al. 2000). Both data sets are available in the R package twang (Ridgeway et al. 


2015). The lalonde data set was collected to study the effectiveness of job training programs on 


future earnings of 614 participants (185 treatment, 429 control), with 8 adjustment covariates (4 
numeric, 4 binary). The lindner data set describes an observational study of the impact of Percu¬ 
taneous Coronary Intervention (PCI) on cardiac-related costs for 996 patients (698 treatment, 298 
control), given 7 covariates (3 numeric, 4 binary). 

Throughout the examples, we use PSM for matching (without replacement), using the R package 


Matching (Sekhon, 2011). Furthermore, as a concrete illustration of a calibration problem, we focus 


on identifying the caliper size used in matching (on propensity scores). However, our framework 
can be applied equally well to other matching techniques - such as Mahalanobis distance matching 
- as well as calibration problems, such as determining what terms to include in propensity score 
model, whether to use matching with or without replacement, etc. 

All simulation results presented in Section [^ utilize the closed-form expressions developed in 


















































Matching and linear regression 5 
Section All the core functions, as well as the diagnostic and calibration methods described 


in this paper are all available as part of an open-source R package, MatchLinReg (Mahani and 


Sharabiani, 2015). See Section 3.5 for an overview of MatchLinReg 


1.3. Complexities of TE estimation: Looking beyond double-robustness 

A common approach for combining matching and regression in statistical analysis is to use matching 


as a non-parametric pre-processing step before regression (Ho et ah, 2007). This approach allows 


practitioners to apply the familiar regression tools to their problem once matching is done, although 
it has been argued that issues such as calculation of standard errors must take into consideration the 


matching step (Austin, 2007). In linear regression with a treatment indicator binary variable and 


other (adjustment) covariates included, the coefficient of treatment indicator equals the treatment 
effect. Including adjustment covariates in regression helps remove residual bias due to imperfect 
matching. A special case is when no adjustment covariates are used in regression; we call this 
method the ‘simple difference’ method, since it is equivalent to calculating the mean outcome 
in treatment and control groups and subtracting them to obtain TE. As an alternative view, 
matching helps reduce sensitivity of regression adjustment to model misspecification (Section |2.4[ ). 
This mutually-reinforcing effect of combining matching and regression for TE estimation has been 


referred to as ‘double-robustness’ in the literature (Stuart, 2010). 


Figure illustrates the combined effect of matching and regression for TE estimation, using 
Monte Carlo simulations based on lalonde and lindner data sets. For each data set, noise variance 
as well as coefficients of original variables were extracted from linear regression on full data sets, 
and a quadratic term {re74? for lalonde and ejecfrac^ for lindner) was added to the generative 
model, with its coefficient chosen so as to achieve an omitted R-squared (i?^) of 5.9% and 3.3% for 
lalonde and lindner, respectively (40% of from regression on main effects.). quantihes the 
percent of variation in outcome attributable to covariates that are not included in the regression 


model. See Section 3.2 for mathematical definition. Regressions used in Monte Carlo simulations 


did not include this quadratic term, in order to simulate regression misspecification through co¬ 
variate omission (Section |2.2| ). As expected, matching and regression each signihcantly reduce bias 
(compared to simple difference method) in both data sets, and combining them further reduces 
bias. 

However, mean squared error (MSE) is the sum of squared bias and variance: 


MSE = (bias)"^ -|- variance 


( 1 ) 


While matching reduces TE bias in regression adjustment caused by covariate omission, yet it 
does so by discarding data points and thus reducing data size, leading to increased variance. As 
Figure [T] shows, while in lalonde data set the bias-reduction effect of adding matching to regression 
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lalonde 



lindner 


LD 



Fig. 1: Combining matching and regression for TE estimation in lalonde (left) and lindner (right) 
data sets. For each data set, error components (squared bias, variance, MSE) are compared using 
four approaches: 1) simple difference, 2) matching followed by simple difference, 2) regression 
adjustment, and 3) matching followed by regression adjustment. Horizontal line indicates minimum 
achievable MSE. Matching was done on propensity scores (on linear scale) resulting from logistic 
regression of treatment indicator on included covariates and (omitted) quadratic term, using a 
caliper of 1.5 (lalonde) and 0.2 (lindner). R package ‘Matching’ was used for matching. Numbers 
are based on 10,000 Monte Carlo simulation. For lalonde, outcome is re78 divided by 1000, while for 
lindner, outcome is the logarithm of cardbill. Omitted covariate is re74? for lalonde and ejecfrac^ 
for lindner, with coefficient of this omitted covariate chosen so as to achieve an of 5.9% and 
3.3% for lalonde and lindner, respectively. 
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more than offsets the increased variance such that MSE is reduced, the net effect for lindner is 
adverse, such that regression alone is the best option. This represent one of the key challenges in 
calibrating the matching-plus-regression combination, i.e. striking the right balance between bias 
reduction and variance increase so as to minimize MSE for TE. Also, it must be noted that despite 
the adverse effect of matching on variance, matching discards data quite efficiently and therefore 
increases variance slower than randomly discarding data (Theorem]^. 

Variance considerations are not the only source of complexity. Matching affects bias indirectly 
and only by adjusting the distribution of covariates across treatment and control groups. In deter¬ 
mining the ultimate impact of matching on TE bias, we face several complexities: 1) Bias induced 
by an omitted covariate is not a monotonic function of its imbalance (Figure]^ top row); 2) match¬ 
ing does not improve balance for all covariates, although it tends to do so for the most imbalanced 
covariates (Figure]^ middle row); 3) as a result of 1 and 2, matching does not improve bias due 
to all omitted covariates, but again it tends to reduce the largest biases (Figure]^ bottom row); 
4) Finally, in generating Figure]^ we assumed only one omitted covariate at a time, but in reality 
multiple such covariates can be missing from our regression model. What combination of these 
unknown terms, and their impact on TE bias, should be considered in how we combine regression 
and matching? The mathematical framework and diagnostics software developed in this paper seek 
to address the above challenges. 

Note that Equal-Percent-Bias-Reduction (EPBR) property of matching techniques (Rubin, 1976) 
such as PSM does not solve the complexities associated with TE bias: 1) EPBR is valid only when 
covariate distributions follow restrictions such as being multivariate normal, 2) EPBR only works 
for linear combinations of covariates, and not their nonlinear functions such as interactions and 
powers, which are our main candidates for omitted covariates, thanks to ‘ignorability of treatment 
assignment’ assumption (Section |2.1[ ). 

2. Framework 

In this section, we develop the mathematical framework for analyzing the combined effect of match¬ 
ing and linear regression for TE estimation. We begin by dehning the assumptions and notation 
used in the paper. 


(Rubin, 1976 


2.1. Approach and assumptions 


Ignorability of treatment assignment Also referred to as ‘unconfoundedness’ (Imbens, 2004) and 


‘conditional independence’ (Lechner 1999, 2002), this assumption states that, conditioned on the 


covariates available in our data set, treatment assignment is random (Rosenbaum and Rubin, 1983). 


In the context of a standard linear model (see below), this assumption means that all covariates 
in the generative model are derivatives (e.g. powers, interaction terms, splines, etc) of a core set 
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Fig. 2: Complex relationship between matching, covariate imbalance, and omission bias for lalonde 
(left) and lindner (right) data sets. Selection of coefficients for omitted covariates, and regression 
and matching methods and parameters are identical to FigureEquations of Section 2.3 are used 
for bias calculations. Top row: squared bias vs. absolute SMD for all second-order interaction terms 
(missing from regression model). Middle row: Comparison of SMD before and after matching for 
omitted terms. Bottom row: Comparison of square bias, before and after matching, for omitted 


terms. 
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of covariates included in the data set. This, in turn, means that perfect matching on the core 
set of covariates would automatically lead to perfect matching on the full set of covariates in the 
generative model. 


Matching as pre-processing for regression adjustment Similar to Ho et al. (2007) [more exam¬ 
ples] , we use matching as a pre-processing step before a parametric estimation of TE using linear 
regression. Combined with the parallel response surface assumption (see below), linear regres¬ 


sion on the full data set produces TE as the coefficient of treatment indicator variable (Imbens 


2004). Our approach can be considered an extension of two-group ANCOVA (Rubin, 1973 Car¬ 


penter, Quade 1982) to include the impact of regression misspecihcation parameterically 


and through the covariate omission concept. Notable alternatives include regression on matched 


pair differences (Rubin 1979), and separate regressions on treatment and control groups (Imbens 


2004). Using linear regression for parametric TE estimation allows us to derive closed-form ex¬ 


pressions for TE bias and variance (Section 2.3) for finite samples and without any parametric 
assumptions about the distribution of covariates in the treatment and control groups, or how their 
corresponding covariate matrices are related. Matching followed by a simple-difference calculation 
of TE, i.e. without regression adjustment, is a special case of the above framework, but with no 
adjustment covariates included in the regression model. In this case, the regression model has only 
two coefficients: intercept and TE. 

Parallel response surfaces This assumption states that the difference between functions describ¬ 
ing mean outcome for treatment and control groups is independent of the covariate vector. In other 
words, TE is constant for all observations, conditional and marginal/average TE are the same, and 
independent of sample. In a linear regression setting, this assumption means the generative model 
does not contain any interaction terms between treatment indicator and adjustment covariates, and 
correspondingly we do not include any such interaction terms in the regression model. While an 
important restriction, the parallel response surface assumption allows us to focus on developing the 
mathematical framework without concern for what relative weight to attach to error in calculating 
TE for different strata. 

Covariate omission as model misspecification Matching reduces potential TE estimation bias 
caused by a misspecihed regression model. In the linear regression framework - where terms can be 
nonlinear in original covariates - misspecification manifests itself in the form of covariate omission, 
i.e. one or more terms included in data generation being absent from the regression model. Impor¬ 
tantly, including nuisance terms in regression, i.e. that were not part of the generative model, does 
not induce bias, but only increases variance. 


Exclusion of outcome variable from analyses It is possible to use the outcome variable - im¬ 
plicitly or explicitly - during the matching calibration process. For example, one may focus on 
achieving better covariate balance for those covariates that show a stronger correlation with out- 
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come. While careful use of outcome variable can enhance the quality of entire modeling process 
including matching, in the hands of inexperienced practitioners it can be lead to overfit models 


with poor generalizability in real-world ( [Babyak 2004). We have therefore chosen to develop our 
diagnostic and calibration tools (Section]^ with complete exclusion of outcome variable from con¬ 
sideration. In addition to preventing overfit creep, this approach also makes our framework and 
tools ideal for cases where outcome data is not available yet and matching is used to select subjects 


for follow up (Stuart, 2010). 


2.2. Problem definition 

In this paper, we focus on causal inference for a Standard Linear Model (SLM), which consists 
of two components: a data generation (generative) model and a coefficient estimation (regression) 
model: 

1. Generative model: Outcome (y) is continuous and its mean is a linear function of covariates 
(X): 


E[y] =X/3. 


( 2 ) 


Data generation has an i.i.d. noise distribution: 

cov[y] = al I. 


( 3 ) 


Data-generation covariates consist of a binary treatment indicator (w), a unit vector (1), as 
well as a set of adjustment covariates (Z), none of which include interactions with treatment 
indicator (parallel response surface): 


X = 


w 1 Z 


( 4 ) 


w is 1 for treatment observations and 0 for controls. Correspondingly, the vector of coefficients 
(/3) consists of TE (r), intercept (/3o) and vector of coefficients for adjustment covariates ( 7 ): 

r 1 1 


(3 = 


T Po 7* 


( 5 ) 


For SLM, average TE for treated (ATT), sample average TE (ATE) and average TE for 
controls (ATC) are all the same and equal to the coefficient of treatment indicator variable, 
i.e. T. We refer to this entity simply as TE (treatment effect). 

2. Regression model: Ordinary-Least-Squares (OLS) is used to estimate model coefficients, in¬ 
cluding r. However, only a subset of covariates (X*) are included in the regression model. 
This subset includes w and 1, as well as AT* adjustment covariates (Z*), which we call ‘in¬ 
cluded adjustment covariates’ or simply included covariates. The remaining K° adjustment 
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covariates (Z°) are called ‘omitted adjustment covariates’, or simply omitted covariates: 


Z = 

X = 


z* z° 

w 1 Z* 
X* z° 


( 6 ) 

( 7 ) 

( 8 ) 


The vector of coefficients for adjustment covariates is correspondingly partitioned into included 
( 7 *) and omitted ( 7 °) vectors: 


7 = 
(3^ = 

P = 


7 





T (3o 7 *’* 


t 



(9) 

( 10 ) 

( 11 ) 


Ordinary-Least-Squares (OLS) estimation of coefficients leads to: 


(3 = (X*’‘X*)-^X*’V 


( 12 ) 


with 




f j3o 7 * 


1 1 


(13) 


Note that omitted covariates ( 7 °) are not estimated, since they are not included in the re¬ 
gression model. 


2.3. TE Bias and variance equations 

As mentioned earlier, TE estimation bias in regression adjustment is caused by model misspecihca- 
tion, which is manifested as covariate omission in our framework, i.e. when Z" 7 ^ 0, or equivalently 
when X 7 ^ X*. It is easy to check that a correctly-specified model is unbiased: 

X* = X ^ E[/3] = (X*X)^^X*E[y] = (X*X)~iX*(X/3) =/3, (14) 


where we have combined Eqs. [^and 12 With missing covariates, we have: 

E[^] - /3* = (X*’*X*)-^X*’‘ (X*/3* + ZV) - /9* = (X*’*X*)-i(X*’‘Z°) 7 °. (15) 


As for variance, we have the following standard expression: 

cov[^] = (X*’^X*)-^ 


(16) 


(In the case of repeated observations, e.g. in matching with replacement, the bias expression remains 
valid, while the variance expression must be modified. See Appendix |^) Given our convention 
regarding the order of covariates in Eq. 13 TE bias and variance are the hrst / top elements of the 


bias vector / covariance matrix, respectively, given in Eqs. 15 and 16 
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Next, we rewrite the above expressions in a way that highlights the role of covariate imbalance 
and the contribution from matching. Since Eqsjl5| and 16 both contain (X*’*X*)“^, we begin by 
transforming this matrix. From Equation we have: 


X*’*X* = 


w 

P 


w 1 Z* 


(17) 


Nt Nt w*Z* 

Nt Nt + Nc 1*Z* 
Z*’*w Z*’‘l Z*’*Z* 


(18) 


where we have taken advantage of w*w = l*w = w*l = Nt and 1*1 = X* + Nc- Since X*’*X* is 
symmetric, so is its inverse. To calculate TE bias and variance, we are only concerned with the 
first row/column of (X*’*X*)“^. In Appendixwe prove that: 

(P-9)*^”^“* 


(X*’*X*)-^ = 


V 


(19) 


/ 


where ti* is the vector of mean differences of included covariates, p and q are vector of sums of 
included covariates across treatment/all observations, and A is the pooled, within-group covariance 
matrix for included covariates: 


v}=N-(i- w)* Z* - ^ w* Z* 


1 

K 

p = Z*’*w, 


1 


q = Z*’*l, 

A=iNt- 1) cov(Z*)r + (Ae - 1) cov(Z*)c. 
Within-group covariance matrices are formally defined as follows: 

cov(Z*)t = (It - ^1t4) Z^, 

cov(Z*)c = (Ic - Zb, 


A,-l 


( 20 ) 

( 21 ) 

( 22 ) 

(23) 

(24) 

(25) 


where Z^ and Z^ are the row subsets of included adjustment covariate matrices for treatment and 
control groups, respectively. It and Ic are identity matrices of dimensions Nt and Nc, and It and 
Ic are unit vectors of length Nt and Nc, respectively. 

TE bias is the first element of the bias vector in Eq. [Tsl which is the result of multiplying the 
first row of (X*’*X*)“^ X*’*Z° by 7 °, while TE variance is simply (Tq multiplied by the top element of 
(X*’*X*)“^ (Eq. 16). Using Equation 19, we obtain the following expressions for bias and variance: 
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+ u^'^a-^z^Az°y. 


2 2 
a =a. 




(26a) 

(26b) 


We refer to Equations 26a and 26b as the standard equations for bias and variance. As shown in 
Appendix [D| the above expressions can be transformed into normalized forms: 

1/2 


5 = 


NtN, 


(Nt + N,){Nt + iVe + 1) 


{p°’‘ + p*’‘A-i(p>°’* - $“)} 


V*) = f^Ln(l + P*’‘A V*) 


(27a) 

(27b) 


where p is the vector of treatment-covariate correlations, is the cross-correlation matrix between 
included and omitted covariates, A is the weighted, pooled correlation matrix, El* and S° are the 
diagonal matrix of variances for included and omitted covariates respectively, and is the 

minimum achievable variance (Theorem]^. These symbols are mathematically defined as: 


p* = [corr(w,z(;.)]fc=i_„.^ 7 i'. 
p° = [corr(w,z°fc)]fc=p,,.,ii'o 

= [corr(z(^,^,z%)]fc^=i„..,^.,fe,=i,..„i^o 

A = 5 ]*--i/2 a + Nc-l) 

S* = [C0v{z\, 6k,,k2]k^,k2=l,...,K^ 

S° = [cov{z%,z%)6k„k2]kuk2=i:...,K‘> 

^2 _ ^ 2 / 1 1 \ 

- ^0 1 ^^ + 


(28) 

(29) 

(30) 

(31) 

(32) 

(33) 

(34) 


The normalized equations make it clear that TE variance is independent of shifting and scaling of 
covariates. More importantly, the normalized bias expression of Eq. |27a makes the role of covariate 
balance and matching much easier to discern (Theorem [^. 

The vector of mean differences, u^, can be normalized to obtain a vector of ‘standardized mean 
difference’ values for each included covariate, d*. These vectors are proportional to the vector of 
treatment-covariate correlations, p®: 


d* = 
P* = 
P* = 


-S* 


NtNc 


1/2 


(W + A'c)(lVt + A'c-l), 

NtN, ^ 

{Nt + N,){Nt + N,-1) 




(35) 

(36) 


d*. 


(37) 
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Eq. 36 is proven in Appendix and Eq. 37 follows from Eqs. 35 and 36 Similar relationships 


IS 


hold for the corresponding vectors of omitted covariates, u°, d° and p°. A corollary of Eq. 37 
the following: 

Corollary 1. If a covariate is balanced across treatment and control groups, it is uncorrelated 
with the treatment indicator variable. 

Proof. Balance for covariate k means its standardized mean difference, dk, is zero. This mean, 
according to Eq. 37, that pk is zero. □ 


Having derived the standard and normalized expressions for TE bias and variance using linear 
regression, we are now ready to study the impact of matching on each error component. 


2.4. Impact of matching 

The impact of matching on TE estimation bias and variance is captured in the following two 
theorems. 

Theorem 1. (Matching and bias) In a standard linear model, TE estimation is unbiased if all 
included and omitted covariates have equal means aeross treatment and control groups. 

Proof. The premise means d* = d° = 0. This, according to Corollary[^ means that = p° = 
0. From Eq. 27a we conclude that <5 = 0. □ 


While matching reduces TE estimation bias, it does so by discarding data and thus may increase 


variance. As discussed in the literature (Ho et al. 2007 Stuart, 2010), matching does not increase 


TE variance as rapidly as random subsampling of data would. The following theorem formalizes 
this intuitive and empirical notion. 

Theorem 2. (Matching and TE variance) Among all data sets with a given treatment (Nt) and 
eontrol (Nc) group size, generated from a standard linear model with noise variance of Uq, a data 
set with balanced distribution of included eovariates achieves the lowest (potentially misspecified) 
OLS-based estimation variance for TE, and this minimum variance equals (Tq (l/Ac + l/lVt). 

Proof. Since within-group covariance matrices, cov(Z*)r and cov(Z*)c', are positive semi- 


definite (Gentle, 2007), their weighted sum in Eq. 23 is also positive semi-definite. Since A is 


positive semi-definite, A~^ is positive definite (with eigenvalues being inverse of eigenvalues for 
A). Therefore, by definition of positive-definiteness, > 0, Vu* ^ 0. The equality hap¬ 
pens when u* = 0, i.e. when included covariates are balanced, leading to As shown 

in Appendix , sampling with replacement increases variance, and thus the minimum variance 
established in this theorem applies to matching with replacement as well. □ 
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A few observations are worth mentioning with regards to the bias/variance expressions and the 
resulting theorems: 

1. For TE estimation to be unbiased, not only included but also omitted covariates must be 
balanced. Since we don’t know what omitted covariates are (otherwise we would include them 
in regression and eliminate bias), we generally cannot ensure they are balanced. An exception 
is perfect matching where, e.g. each treatment observation is perfectly matched - with respect 
to all original covariates - with a control observation. But thanks to ignorability assumption, 
we are ensured that all omitted covariates are also functions of included covariates. Therefore, 
perfect matching on included covariates would automatically lead to perfect matching on 
omitted covariates. But perfect matching is often impractical, either because some covariates 
are continuous, or because perfect matching leads to an excessively small sample size and 


hence large variance, or both. Rubin (1973) similarly observe that matching on covariates 


in a linear regression or ANCOVA analysis does not guarantee bias removal for nonlinear 
response surfaces. 

2. In the absence of perfect matching, we must accept the fact that TE will be biased, and instead 


do our best to minimize it. Again, our main challenge is that, according to Eq. 27a, bias is 
a function of included as well as omitted covariates. This represents a key difference between 


bias and variance equations, since variance, according to Eq. 27b only depends on included 
covariates. In other words, quantifying bias is a more difficult proposition than quantifying 


variance. In Section 3.2 we will present a framework for quantifying bias that is independent 
of outcome variable. 

3. In Theorem we showed that a data set that is balanced with respect to included covariates 
achieves A second way to achieve minimum variance is to simply include no adjustment 

covariates in the regression, i.e. only have intercept and treatment indicator. This is the simple 
difference method mentioned in Section 1.3 As we saw in Figure while this approach has 
low variance, using it with an unmatched data often produces a large enough bias that more 
than offsets the low variance and leads to large MSE. 


2.5. TE bias and orthogonality 

Neither the standard (Eqs. 26a and 26b) nor the normalized (Eq. |27a and 27b) expressions for 
bias make an intuitive concept clear: An omitted covariate that is very ‘similar’ to the included 
covariates will not induce much bias, since its contribution to the outcome will be captured by the 
coefficient of the included covariate acting as a surrogate for the coefficient of omitted covariate. 
This concept can be formally stated in the following theorem: 


Theorem 3. (Bias and orthogonality) Projections of omitted covariates onto the subspace spanned 
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by included adjustment covariates and intercept variable produce zero TE bias, i.e. only components 
of omitted eovariates orthogonal to the aforementioned subspace eontribute towards bias. 


Proof. Equation can be expanded in terms of columns of Z: 

K° 

E[/3] - /3 = (38) 

k=l 

To prove our theorem, and recalling our convention for arranging covariates in X* according to 
Eq. we simply prove that the matrix operator (X*’^X*)“^X*’* acting on any one of K columns 
of Z*, i.e. z*^, produces a X-dimensional vector whose first element is zero. This is easy to see 
since, by definition, (X*’*X*)“^X*’*X* = 'i-K^+ 2 ^ where is the identity matrix of dimensions 

X* + 2. Therfore, (X*’*X*)“^X*’*z®^ equals the (A: + 2)’th column of the identity matrix 1^+2, whose 
first two elements will always be zero. Therefore, omission of any linear combination of z®^’s will 
produce zero TE bias. □ 


This theorem will be used in Section 3.2 to develop the constrained bias estimation approach. 


3. Applications 

In Section we developed the mathematical framework for quantifying TE bias and variance, as 
well as the impact of matching, in a standard linear model (Section |2.2[ ). In addition to providing 
a theoretical basis for understanding the combined use of matching and linear regression, this 
framework can be utilized towards developing diagnostic and calibration tools for causal inference. 
Such applications are the subject of this section. 


3.1. Efficiency and generaiizabiiity of simuiations 

The first application of the framework developed in Section particularly the closed-form expres¬ 
sions for bias and variance (Eqs. 26a 26b, 27a, and 27b[ ), is that it allows for fast and accurate 
simulations of the combined effect of matching and variance without requiring lengthy Monte Carlo 
simulations. Figure [^compares the bias and variance calculations using our closed-form expressions 


- encoded in the R package MatchLinReg (see Section 3.5) - as well as Monte Carlo simulations, 
using 10,000 iterations to generate each point. While closed-form expressions produce results nearly 
instantaneously, the MC approach took several minutes on an average laptop, despite the data sets 
being quite small (fewer than 1000 observations). For larger data sets, the advantage of our frame¬ 
work becomes more prominent. Such efficient simulation tool allows for exhaustive exploration of 
data sets and parameter spaces in future research, ultimately leading to better empirical rules for 
combining matching and linear regression. 

In addition to speed of simulations, an equally-important contribution of our framework is that 
it provides for better generaiizabiiity of simulation results since we have a theoretical basis for what 
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Fig. 3: Comparison of TE bias and variance calculations using Monte Carlo simulations (10,000 
iterations per data point) and closed-form expressions of our framework for lalonde (left) and lindner 
(right) data sets. MC calculations took ~10-15min, while closed-form calculations took less than 
O.lsec. Data generation parameters are the same as those reported for Figure 


are the drivers of each error component. For example, Eqs. 26a and 26b make it clear that TE bias 
and variance do not depend on coefficients of adjustment covariates included in the regression model. 
Therefore, this aspect of data sets can be safely ignored while comparing their simulation results 
across studies. The enhanced generalizability of simulations resulting from utilizing our framework 


directly addresses the requirement stated in Imbens (2004). This paper is the first beneficiary of 


the framework, as all the simulations in the rest of this paper utilize the aforementioned closed-form 
expressions. 


3.2. Quantifying bias 

A key challenge in assessing the benefits of matching is that TE bias is a function of the unknown. 


omitted covariates and their coefficients (Eqs. 26a and 27a) and hence any bias-removal impact of 
matching is similarly unknown, and hard to quantify. Naively, it might seem like a reasonable idea to 
use ‘data exploration’ to identify such omitted covariates and the strength of their impact, e.g. using 
pairwise correlation analysis of various candidate terms with outcome. Such explorations, however. 


can lead to overfitted models, particularly in the hands of inexperienced practitioners (Babyak 


2004; Ho et ah, 2007). Furthermore, if we could somehow learn what the omitted covariates are. 


we could simply include them in the regression model and eliminate the bias that way, rather than 
using matching. Instead, we opt for a dual approach: 
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1. Bias-oriented analysis: We calculate maximum TE bias, normalized by the upper bound on 
the amount of unexplained variation in outcome that is due to omitted covariates, and where 
the vector of contributions from omitted covariates lies within a pre-specihed subspace. This 
normalized bias is then combined with TE variance over a range of values for the bias-to- 
variance conversion factor, which we call ‘omitted R-squared’. This analysis begins with a 
set of diagnostic tools (Sections |3.2.1 3.2.2 and 3.2.3), and culminates in a calibration tool 
(Section |3.3[ ). As it emphasizes bias, this analysis favors matching. 

2. Variance-oriented analysis: We assume the other extreme with regards to the strength of 
omitted covariates, namely that it is zero. Assuming a correct model specification - and 
hence zero bias - we calculate the reduced study power resulting from the choice of matching 
parameters in previous analysis (Section |3.4[ ). This represents a worst-case-scenario as far as 
the negative impact of matching on TE estimation error. 


Practitioners must weight the benefits of matching (from the first analysis) against the worst- 
case-scenario from the second analysis to determine the best course of action. An interesting area 
for future research is to combine these two facets into a unified framework, possibly using risk- 
based Bayesian methods to offer an even more prescriptive path towards matching calibration for 
practitioners. 


With the above roadmap in mind, we begin developing a methodology for quantifying TE bias 
in misspecified regression. Eq. |26a can be re-written as follows: 


<5 = g*zv 


K° 




(39) 


where 




1 


1 


w+(-— + — (p- qfA-^ u^]l + 


(40) 


39 


means 


and is the /c’th omitted covariate, 7 ^ is its coefficient, and g is a vector of length N. Eq. 
that the contribution of each omitted covariate towards TE bias is 1) additive, 2) proportional to 
its coefficient, and 3) dependent on how parallel it is with g. 


While we do not know exactly what the omitted covariates are (otherwise they would have been 
included), we can often form a reasonable candidate set, e.g. by forming interactions (second-order 
or higher) and powers of the included covariates. Given the additive property of bias, studying the 
impact of matching on bias due to each potentially-omitted covariate is a meaningful first step. 
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Fig. 4: Relative squared bias reduction for all second-order terms as candidate omitted covariates. 
Results for lalonde (left) and lindner (right) data sets are shown. Regression adjustment using 
all main effects (‘before’) is supplemented by propensity score matching as a pre-processing step 
(‘after’), using caliper size of 1.5 for lalonde and 0.2 for lindner. Omitted terms are sorted by 
decreasing relative bias reduction. Terms in red are those for which matching has increased squared 
(or absolute) bias. 


3.2.1. Single-covariate relative squared bias reduction 

A first diagnostic tool is to compare (squared) bias caused by a candidate omitted covariate, before 
and after matching. A relative bias reduction metric will be independent of 7 ^: 


Sr:i - 






\a.t 


(41) 


where i and / superscripts refer to before and after matching, respectively. This quantity (or its 
square) can be produced for all candidate omitted terms, producing plots similar to those shown 
in Figure]^ Ideally, we would like matching to eliminate, or at least reduce, bias for all terms, but 
the figure shows that this may not happen. Practitioners may need to examine terms for which 
bias increases more closely, including performing the next analysis. 


3.2.2. Comparison of single-covariate normalized biases 

While relative bias reduction is calculated independently of omitted coefficients ( 7 °), yet it does not 
provide any comparison of the magnitude of biases induced by each omitted covariate. Matching 
may not significantly reduce the bias due to an omitted covariate, yet if the overall bias contributed 
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by that covariate is likely to be small, lack of bias reduction by matching for that covariate becomes 
unimportant. 

Since each term’s contribution is proportional to its coefficient ( 7 ^), we must somehow determine 
these coefficients. As emphasized before, we opt for an approach that does not use the outcome 
variable. If omitted covariates have been orthogonalized with respect to {1,Z*} (the subspace 
of included adjustment covariates plus intercept), their mean squared contribution to outcome is 
roughly the same as the change in R-squared of the regression model, multiplied by cJq. This 
motivates us to define a parameter called ‘omitted R-squared’: it is the ratio of mean squared 
contribution to outcome for an omitted covariate (or a set of covariates), after orthogonalization, 
to noise variance: 


R 


2 

O 


Wi.Y\\V4 


(42) 


This parameter can be intuitively described as the percentage of variation in outcome that is not 
explained by the regression model. If included covariates are all linear terms (in original covariates), 
then one can also think of R^ as the degree of nonlinearity in the data-generation process. Note that 
we do not propose that this parameter be extracted from the data, e.g. through the use of outcome 
variable. Rather, this parameter must be determined based on a domain expert’s knowledge of the 
underlying mechanisms involved as well as experience and future simulation studies. We will use 


this parameter in Section 3.3 to put bias on the same scale as variance and combine them to arrive 
at total MSE for TE. 

Given the above, we propose a constrained bias estimation approach where we estimate bias 
subject to the constraint that mean squared contribution from omitted covariates equals a given 
value. In fact, we can canonically set this value to 1 and estimate a normalized bias value: 




(43) 


Normalized bias has an intuitive interpretation: it is the fraction of unexplained/omitted variation 
in outcome (or ‘signal’) that turns into TE bias. Eor a perfectly-matched data set, this fraction is 
zero. 

Within this general approach, several variations can be conceived of. We begin with the simplest 
approach, where we calculate and compare normalized bias for each of the candidate omitted 
covariates. In other words, we permit 7 '° to have exactly one non-zero element. This produces K° 
normalized bias numbers, 5^: 

= (44) 

We can now produce and compare these single-covariate, normalized biases for a set of candidate 
omitted terms, perhaps before and after matching, and for different calibrations of matching to 
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identify nonlinearities that can potentially induce large residual bias, even after matching. Since 
absolute values are more important than signs, we focus on squared bias, a metric which is also 
dimensionally compatible with variance and total MSE. 

An example is provided in Figure where single-covariate normalized square bias has been 
plotted for a handful of second-order interactions as a function of matching caliper size. We see 
that matching does manage to reduce the normalized bias for the biggest offenders. A notable 
exception is the female:stent interaction in the lindner data set, where matching has raised the bias 
to relatively significant level after matching (~ 1% of omitted variance). Also, note that the overall 
normalized bias levels are significantly higher in lalonde data set, compared to lindner. In other 
words, a larger fraction of omitted signal translates into TE bias in lalonde data set. Figure also 
illustrates that impact of matching on covariate imbalance (measured via absolute mean difference) 
does not have a simple relationship with bias reduction. 


3.2.3. Estimating aggregate bias 

While producing and examining normalized squared biases for interaction terms is insightful, it 
stops short of producing a single, aggregate measure of bias, which we would need to combine with 
variance to arrive at an estimated MSE. We consider three aggregation methods below. They all 


produce a normalized bias estimate (Eq. 44), and are all based on maximizing normalized bias 
within a given, eligible subspace. They differ in how this eligible subspace is constructed. 

Single-covariate maximization: Eligible subspace is one of K° candidate omitted covariates. 
The aggregate bias estimate is simply the maximum value of normalized bias terms due to each 
candidate, omitted covariate, {5”^}. 

Covariate-sub space maximization: Here the eligible subspace is the entire covariate subspace 
that is orthogonal to {1, Z*}, and the aggregate bias estimate is the linear combination of omitted 
covariates - after orthogonalization - that maximizes normalized bias. This bias estimate is bound 
to be greater than or equal to the output from single-covariate maximization approach. The 
direction of must be parallel to the projection of g in the omitted covariate subspace, after 
orthogonalization (gn): 


<5 = \/]V g*g||/(g||g||)^/^ 


(45) 


Absolute maximization: The eligible subspace is the entire subspace orthogonal to {1,X*}, and 
the aggregate bias estimate in this case is the absolute maximum normalized bias achievable. Here 
the direction of Z°'y° is simply parallel to g: 
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Fig. 5: Top row: Comparison of normalized squared biases due to omission of 10 second-order terms 
from regression adjustment for lalonde (left) and lindner (right) data sets. Horizontal axis is caliper 
size used in propensity score matching, with the right-most point on each plot corresponding to no 
matching. Bottom row: Impact of matching on absolute mean differences for same 10 interaction 


terms. 
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Fig. 6: Impact of matching and caliper size on TE bias using single-covariate maximization, 
covariate-space maximization, and absolute maximization for lalonde (left) and lindner (right) 
data sets. 

Figure compares the bias estimated for these three methods, along with impact of matching 
on bias reduction for each method. Of these three measures, we believe the covariate-subspace 
approach is the most suitable approach. While absolute maximization is too extreme as far as 
bearing no connection to adjustment covariates, single-covariate maximization approach is too 
rigid in focusing on a very limited set of possible combinations of candidate omitted covariates. 
Interestingly, bias estimate using absolute maximization does not change with matching. This 
property is proven in Appendix [E| 

It is possible to replace maximization with another operator - such as averaging - in each of the 
above methods. The fact that using the maximum operator tends to exaggerate the impact of bias 
motivates us to present a counter-view in which we assess the damage caused by matching towards 
variance increase and the resulting loss of study power, assuming a total absence of bias. This is 
described in Section [TH 

3.3. Combining bias and variance 

In Section |3.2| we developed the constrained bias estimation method for arriving at ‘normalized 
squared bias’, i.e. the ratio of squared bias to omitted signal. On the other hand, Eqs. |26b| and |27b] 
allow us to calculate ‘normalized variance’, i.e. the ratio of TE variance to Cq, in a straightforward 
manner. In order to combine normalized bias and variance to arrive at MSE, we must have an 
‘exchange rate’. We propose an intuitive parameter, ‘omitted R-squared’, which we simply define 
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to be the ratio of omitted signal to generative noise. If we define ‘normalized MSE’ as the ratio of 
MSE to ctq, we have: 

normalized MSE = normalized variance + x normalized squared bias (48) 

Eigurej^ shows normalized MSE as a function of matching caliper size for various values of for 
lalonde and lindner data sets. For small values of variance dominates bias and thus optimal 
caliper size (that minimized MSE) tends to be large (or we may choose no caliper, or no matching) 
(top row). As gets larger, bias and variance become comparable and thus the bias-removal 
impact of matching outweighs its variance increase; thus optimal caliper size shifts towards smaller 
values (second row). For even larger (third row), bias dominates variance and even smaller 
caliper sizes become optimal. This trend is seen more clearly in the bottom row, where optimal 
caliper size has been plotted as a function of F(^. These plots can be considered the most prescriptive 
of all our analyses so far: they suggest a particular calibration (caliper size here) assuming different 
values of omitted R-squared. Same type of analysis can be applied to other aspects of matching, 
e.g. what terms to include in matching, propensity score matching vs. Mahalanobis matching, etc. 

How should practitioners go about selecting a value (or a range of values) for omitted R-squared? 
The answer is, partly domain expertise, and partly experience. In particular, if outcome data is 
available, practitioners can calculate R-squared for a main-effect-only regression model, and then 
reason about how much they expect nonlinear effects to improve upon this R-squared. For example, 
assume that the main-effect R-squared is 14%, and that we believe nonlinear effects contribute about 
30% towards explaining data variance. Then we can reasonably assume that = 30% x 14% ~ 4%. 
Some researchers and practitioners may approve of such a limited use of outcome data. 


3.4. Power analysis 


In covariate-subspace maximization approach to constrained bias estimation (Section 3.2.3), we 
erred on using high bias estimation in our bias-variance trade-off analysis by using the maximum 
operator on the eligible subspace. In this section, we take the opposite view by assuming no bias, 
and assess the negative impact of matching in terms of variance increase, or equivalently study 
power decrease. 

Figure shows study power as a function of caliper size for lalonde and lindner data sets. 
Effect size is defined as ratio of TE to noise standard deviation, i.e. r/ao and set to 0.3 for 
both plot. In each case, we used 1000 Monte Carlo simulations to generate mean and standard 
deviation values for study power. For comparison, we have also plotted study power when data is 
subsampled randomly with the same sizes as the matched case from treatment and control groups. 


The significantly higher study power for matched subsamples is an illustration of the minimum- 
variance property of matching (Theorem]^. 
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Fig. 7; Matching calibration by combining bias and variance and minimizing MSE, for lalonde 
(left) and lindner (right) data sets. Top three rows: TE error and its components as a function of 
matching caliper size, for three levels of omitted R-squared. Bottom row: Optimal caliper size as 
a function of omitted R-squared. 
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Fig. 8: Impact of matching caliper size on study power for lalonde (left) and lindner (right) data 
sets. Effect size is defined as ratio of TE to standard deviation of noise (cr). Its value was set using 
regression on data sets, using main effects only, [add numbers for effect size] 

[we probably need another sentence or two summarizing what we have done so far. also, elaborate 
on this MC based approach being the first open-source function specialized in TE estimation and 
not just general regression.] 


3.5. Software: R package MatchLinReg 

All the diagnostic and calibration tools presented in this section have been packaged into an open- 


source R library, MatchLinReg (Mahani and Sharabiani, 2015). To allow researchers to develop 
custom diagnostic tools based on their domain-specific needs, the core set of functions in the package 
have also been made public, in addition to the higher-level wrappers that implement the practical 
tools. Below we briefly describe the structure of the software, and refer the interested reader to 
the package documentation for details. Below we provide a brief description of core and wrapper 
functions. 


1. Core module: Eunctions for calculating bias and variance, finding orthogonal and parallel 
projections of a vector in a subspace, and Monte-Carlo based power calculation for causal 
inference. Eigure provides a summary of how MatchLinReg implements our framework for 
calculation of bias and variance and combining them to produce MSE for TE. 

2. Diagnostic/calibration module: Eunctions for propensity score and Mahalanobis matching 
(wrapper around Matching R package), combining bias and variance over a vector of value 
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Fig. 9: Process diagram for calculating total MSE for TE, using MatchLinReg R package. Boxes 
represent library functions, [describe symbols] 

for omitted R-squared, and performing diagnostic and calibration analyses described in this 
section. 


4. Discussion 

4 . 1. Summary 

In this paper, we developed a framework for quantifying the combined effect of matching - as 
a non-parametric pre-processing technique - and linear regression for estimating TE for causal 
inference. In addition to providing a theoretical basis for the impact of matching on TE bias and 
variance in a misspecified linear regression, the applicability of our framework is strengthened by 
four attributes: I) a finite-sample focus (as opposed to large-sample or asymptotic analysis) makes 
our results directly applicable to real-world problems with small control and/or treatment groups, 2) 
quantifying not just bias but also variance of TE estimation allows us to minimize total MSE, which 
is the ultimate measure of estimation accuracy in a single experiment, 3) exclusion of the outcome 
variable from the equations ensures that the use of our diagnostic and calibration tools by the 
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broad community of practitioners does not lead to overfit results that have questionable prognostic 
value [move overfitting to beginning of sentence], and 4) capturing the entire framework along 
with calibration and diagnostic tools in an open-source software, R package MatchLinReg, allows 
researchers and practitioners to utilize the currently provided set of tools for their observational 
research, extend and experiment with the framework, and implement new diagnostic and calibration 
functions that better suit their application domains. 

4.2. Future research 

Some of the limitations and assumptions of the current work provide for interesting research op¬ 
portunities. Below we highlight two key areas; 

1. Generalized Linear Models (GLMs): The current framework is focused on, and takes full ad¬ 
vantage of, linear models with a continuous (unbounded) outcome variable. This facilitated 
the derivation of closed-form expressions for bias and variance. Maximum-Likelihood (ML) es¬ 
timation of GLM coefficients does not generally contain a closed-form solution. The nonlinear 
link function creates complex interactions between the coefficients of adjustment covariates 
(not just omitted but also included) and TE bias and variance, [add literature on matching, 
bias, conditional vs marginal] 

2. Large regressions as alternative to matching: An alternative to matching for removing covari¬ 
ate omission bias is to simply include all candidate omitted terms in the regression model. Any 
nuisance covariates, i.e. those not present in generative model but included in the regression 
model, do not contribute to bias - as we can simply assume their corresponding coefficients 
are zero - but only increase variance. R may appear that increased variance due to inclusion 
of all candidate terms may overwhelm the regression, thus becoming an unrealistic candidate. 
However, our early simulations and analysis suggest that this may not be the case, and this 
topic deserves further research. 
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A. List of mathematical symbols 


1. X: Design matrix, matrix of covariates, or matrix of explanatory variables (including intercept 
and TE). No distinction is made between included and omitted covariates. 

2. 1: Vector of length N, with all elements equal to 1. This corresponds to the ‘intercept’ 
covariate. 

3. w: Binary vector of length N, indicating whether an observation belongs to treatment group 
(1) or control group (0); also known as ‘treatment indicator’. 

4. Z: Matrix of adjustment covariates, which is the full design matrix with 1 and w columns 


removed: X = 


w 1 Z 


5. ZyZ®: Subsets of Z, consisting of adjustment covariates included in / omitted from the 


regression model: Z = 


Z° 


6. X*: Matrix of included covariates (which always includes 1 and w): X = 


X* Z° 

1 1 

XAT, 


; similarly 
; similarly 


7. x„^: A column vector, extracted from the n’th row of X: X = xi 
defined for Z. 

8 . X fc: A column vector, extracted from the fc’th column of X: X = x i ... x k 
defined for Z. 

9. Xnk'- A scalar, /c’th element of x^; similarly defined for Z. 

10. y: Response (column) vector, assumed to be continuous (linear regression). 

11. N/Nt/Nc- Number of observations in entire data set / treatment group / control group. N 
must be equal to length ofr y and number of rows in X. We also have N = Nt + W. 

12. /3o: Intercept term, or coefficient of covariate 1 . 

13. Wn- n’th element of w. 

14. r: TE, or coefficient of covariate w. 

15. (3: Vector of all coefficients, including intercept and TE. 

16. 7 : Vector of coefficients for adjustment covariates (Z). 

17. f//3o//3/7: Counterparts to r// 3 o// 9 / 7 , but represent estimated values from Ordinary Least 
Squares (OLS) estimator. 
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18. 

19. 


20 . 

21 . 

22 . 

23. 

24. 

25. 


26. 

27. 

28. 

29. 

30. 


Mahani et al. 

T/C: Set of observation indexes belonging to treatment / control groups; T n C = 0 and 

ruc = {i,2,...,iv}. 

p!! p°\ Vector of correlation coefficients between all/included/omitted adjustment covari¬ 
ates, and treatment indicator vector; p = [coYv{\^,z^k)]k=i,...,K-, and similarly for p* and 

Correlation matrix for included adjustment covariates; = [corr(z*^^, 

Correlation matrix between included and omitted adjustment covariates; 

= [corr(z*^^,z;j,J]fc,=i,...,i^',fc,=i,...,i^. 

A; ‘weighted correlation matrix’ for the included covariates. 

p/p^/p°: Vector of means for all/included/omitted adjustment covariates. 

Matrix of diagonal elements of covariance matrix for all/included/omitted adjust¬ 
ment covariates; S = [cov(z^fcj, z^^^) (5fci,A:2 ]A:i,fc 2 =i,...,if'-x, and similarly for X)* and 51“. 

Data balance vector for all/included/omitted adjustment covariates, defined as 




Nr 


'Nt Nr 


(49) 


and similarly for ti* and u°. 

e: Vector of length N, representing the residual noise in generative linear model. 

Noise covariance matrix; 'I' = cov(e). 

5: TE estimation bias, i.e. 5 = E[f] — r. 

(T^; TE estimation variance, i.e. ci^ = E[(f — E[r])^]. 

d/dYd°; Vector of standardized mean differences for all/included/omitted adjustment co¬ 
variates. 


Notes; 


1 . 

2 . 

3. 


Eor notational brevity, we use symbols r//3o//3/7 to represent both variables that are op¬ 
timized in error function, and the true, generative model used to produce the data. The 
distinction should be clear from context, and is clarified in the text if not. 

Covariates are arranged in the full design matrix in a particular order to simplify representa¬ 


tion; X = 


w 1 Z* 


. This does not reduce generality of results. 


Similarly, without loss of generality, we assume that all treatment observations occupy the 
first Nt rows of X (and other corresponding structures), and control observations occupy the 
last Nr rows of X. 


B. List of acronyms 

1. ANCOVA; Analysis of Covariance 

2. ATC; Average treatment effect for controls 
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3. ATE: Sample treatment effect 

4. ATT: Average treatment effect for the treated 

5. MSE: Mean Squared Error 

6. OLS: Ordinary Least Squares 

7. PSM: Propensity Score Matching 

8. SMD: Standardized Mean Difference 

9. SLM: Standard Linear Model 

10. TE: Treatment Effect 


C. Calculating the first row/column of (X* 

Assume that 


(Xhix*)-i = 


a b 

b . 

V . 


(50) 


Using Eqs. 18, 21 and 22, and applying the definition of matrix inversion, we obtain: 


-1 

_1 


a b 


1 . 

Nt Nt + Nc 9* 


b . 

= 

0 . 

p 9 R 


V . 


0 . 


(51) 


where we have defined: 


R EE Z*’*ZL 


(52) 


Next, we expand the expressions producing the first column of the right-hand side (identity matrix) 
to get: 

/ 

Nta + Ntb + V = 1, 

Nta + {Nt + Nc)b + q^v = 0, 

pa + qb + Rt; = 0. 


(53) 


Solving for a and b (in terms of v) between the top 2 sub-equations in 53 leads to: 

1 


1 1 

CL — — -|- — H" - 

Nt Nc NtNc 


{Ntq-iNt + N,)pfv, 




Substituting the above back into the last sub-equation of[^ we get: 

1 


V = 


A-^{Ntq-{Nt + N,)p), 


NcNt 


where 


A = Z*’* Z* -|- —pq^ — (— -I- —) -|- — qp^ -— <7 9* 

Nc ^Nt nJ Nc Nc 


(54) 

(55) 

(56) 

(57) 
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Substituting the above solution for v back into Equation we obtain the following expression: 

(58) 

(59) 

(60) 




V = A-^u\ 


where it® is defined in Eq. 20 Putting it all together, we get the following: 


/ 


(X®’*X®)-^ = 


_ + ^ + u^,tA-W ^ (p - qfA-^ It® ii®'*A-i 

t A -1 , 


V 


-ik + ik (p- 9) A ^u® 

A~^ ii® 


(61) 


... 


where we have taken advantage of A being symmetric. Arriving at Eq. 23 involves some routine 


algebraic manipulation of Eq.|57| while taking advantage of the definitions of within-group covariance 


matrices in Eqs. 24 and 


Note that, when included covariates are balanced (u® = 0), then the above expression becomes: 

/ A I J_ _J_ 0 \ 

M. 'AT M 


(X^x®)-! = 


Nt ^ We 

'We 


V 


(62) 


/ 


D. Normalized bias and variance equations 

We seek to derive the normalized expressions for bias and variance (Eqs. |27a and 27b) from the 


standard forms (Eqs. 26a and 26b). To transform the variance equation, we first prove Eq. 36, show¬ 
ing that the the vector of correlations between treatment indicator (w) and included adjustment 
covariates (Z®) is proportional to the vector of mean differences for Z®. 

Focusing on a single covariate vector (z) of length N, its correlation (p) with treatment indicator 
(w) is given by: 

N 


1 


P = 


{N - 1) az cr^ 

1 


'^{Zn- < Z >)(lC„- < wr >), 


n=l 


-{N < > —N < z >< w >}. 


{N - 1) cr^^7^, 

Taking advantage of the definition of w, it is easy to verify that 

t At 

<wz> = — <Z >T, 

Nt 

A’ 


^2 _ 


NtN, 


N{N -1)' 


(63) 

(64) 

(65) 

( 66 ) 
(67) 
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Combining Eqs. 64 65 66|and[67| we obtain: 

Ari/2(Ar_ 1)1/2 


Nf Nf 

P = /-;- 1/2 172 ^^5^ < Z >T -N— < z >} 


= (r 


NNt 


Z >T — < Z >}. 


■(iV-l)/Vca 2 

On the other hand, from the following self-evident equality: 


Nc < z >c +Nt < z >T= N < z >, 


we conclude that 


rNc Nt 

< z >r - < Z > =< z >T < Z >c < Z >r}, 

r 

= Z >T - < Z >c}. 


Combining Eqs. 69 and 72, we get: 


NNf 1 /o AT; . 

p = (nC- - 2 ) ^ z >T - < Z >c} 


= ( 


(Ai - 1 )/Vc1t2 ' N 
NtN, 

N{N - 1) 


)^/^{< Z>T - <Z >c}M 


( 68 ) 

(69) 

(70) 

(71) 

(72) 

(73) 

(74) 

(75) 


Eq. 36 is simply the vector form of the above expression. 


To obtain the normalized variance equation, we invert Eq. 36 to express w* in terms of p*, and 
substitute into Eq. |26b 


1 


1 






NtN, 


= (Ta 


Nt Nc NtNc ^ ^ 


1/2 A 51 * ) , 


(76) 

(77) 


where, in the last step, we have used the matrix inversion lemma, (AB) ^ = B ^ A Using the 
definition of A in Eq. 31, the above expression can be easily turned into Eq. |27b 


To derive the normalized bias expression of Eq. 27a, we must express its constituent entities, 
i.e. p — q, Z°’*w, Z°’*l and Z°’*Z*, in normalized forms. (We already have an expression for ■u® in 
Eq. [^) The reader can easily verify that: 

1/2 


p - q = -Nc p* + 
Z°’*l = Np° 
Z°’*w = ivJ P° 4 


|^ WAe(A-l) y 




(A-l)Ae y/^ 1/2 
NNt J ^ 


(78) 

(79) 

(80) 


Z° 4 z® = (AT _ 1 ) + /V p° p' 


4/2 


,o . ,i,t 


(81) 
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From the above, we obtain the three additive components of bias (before multiplying 7 °): 

ffi • b ; - * (r • t ) 0 *•“*-‘•'1 - 


( 82 ) 




1/2 






(83) 


z°’* Z* = - 


/lV(iV^y/2 

V NtN, J 


A'l p* 


o *-i j 

pV*’*5]* ' a V 


A*A,(A-l)y - 

Close examination of the above three equations reveals that all the terms involving fi° cancel each 
other out, leading us to Eq. |27a 


E. Invariance of bias estimation under matching using absoiute maximization 


Before matching, the direction of Z° 7 ° with maximum absolute bias is g, and TE bias is given by 
g^(\/]Vg/||g||), where g is the top row of matrix U defined as 


U = (X*’*X*)-^ X*'*. 


(85) 


In other words, g*g is the top-left element of the matrix UU*. After matching, we do not re¬ 
normalize Z°')'° [explain why, here or better in main text]. Without loss of generality, assume that 
matching subselects a contiguous subset (Xi) of Xh 


X* = 


Xi 

X 2 


Inserting Eq. 86 into 85, we obtain: 


U = (X*’*X*)-^ 


Therefore, 


U* = 


Xi 


Xl X* 


(X*’*X*)“\ 


X 2 

Xi (X*’*X*)-1 
X 2 (X*’*X*)-1 


( 86 ) 

(87) 

( 88 ) 

(89) 


ut 


X2 (X*’*X*)-i 


( 90 ) 





















where U* is the matched subset of U*: 
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U* = (91) 

After matching, bias is given by the top-left element of Ui U*, where Ui is given by: 

Ui = (X*Xi)-ix'i. (92) 

Combining the definitions of U* and Ui we get: 

UiU* = (X*i Xi)-^X*i Xi(X*’*X*)-\ (93) 

= (X*’‘X*)-^ (94) 

We see that UiU* is independent of matched subset, thus remaining constant as long as X^ Xi is 

full rank. We also observe that maximum normalized squared bias can be written as: 


xn 2 ^ 

'^max ' 2 ■ 

*^0 


a 


(95) 


F. Matching with repiacement and TE variance 


Her we prove that, in a balanced data set, TE variance is larger than the lower-bound established 
in Theorem if some observations are repeated. In sampling (matching) with replacement, the 
covariance matrix cov[y] is not diagonal, and the expression for covariance matrix of coefficients 
becomes more involved: 


with 


cov[/3] = cov [(X*’*X*)-^X*’V] = D ’^'D^ 

D = (X*’*X*)"^X*’^ 

= cov[y]. 

To obtain TE variance (cr^), we need the top-left element of cov[;9]: 


(96) 

(97) 

(98) 


N 


a = cov 


= (D S D*)i,i = Di,„ (S D*)„, 

n=l 
N N 


l,n' 


(99) 

n=l n'=l 

The above equation indicates that we only need to focus on the first row of D to calculate . Erom 
the definition of D in Equation 97 and X* in Equation]^ and using the special form of (X*’*X*)“^ 
in Equation we conclude that 


D 


/ 

V 


l/Nt + l/Nc -l/Nc 0 
-l/Nc . 

0 . 


\ 


/ \ 


w 

1 * 

Z( 


/ 


V / 


(i/w + i/w)w‘ - ivw 


\ 


( 100 ) 
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Examining the above form reveals the following structure for the hrst row of D: 


Therefore, 


D 


l,n — 


l/Nt, if n G T, 
-1/iVc, if n G C. 


Dl,nDl,n' 


1/A^j^, if n, n' G T, 

< l / A^c ) if ^ C *) 

-l/{NtNc), if n G r,re' G C. 


Combining the above with Equation |99[ we obtain: 


( 101 ) 


( 102 ) 


= Y^ Dl,nDl,n'^n,n' (103) 

n n,n'^n 

To simplify the above sum, we recall the structure of T: its diagonal elements are all equal to Ug, 
and its off-diagonal elements are 0, if observations n and n' are not samples from the same original 
data point, and Ug otherwise. We therefore conclude: 

= al(Yl Kn + E + EE 

\n&T nGC / n&T,n'&T,n'^n 

T 'y ^ y ^ Dl,nDl,n'^n,n' 

n£C,n'£C,n'^n 

Defining the indicator function r[.] such that: 


r[n] = m <^=i> data point n is a sample from the original data point m. 


(104) 


allows us to characterize the entries of S as follows: 


^n,n' — 


(T^, if r[n] = r[n'], 
0, otherwise. 


(105) 


Putting it all together, we get the following expression for TE variance under sampling with re¬ 
placement: 


a 


2 










(Jn 


EE 


Nf 


(Jn 


K 


EE 

ndC ,n'aC ,n' 


1 


1 




/[r[n] == r[n'\\ 
I\r[ri\ == r[n']] 


(106) 

(107) 


This proves that the TE variance lower bound applies to matching with replacement as well. (/[.] 
is the indicator function which returns 1 if argument is true, and 0 otherwise.) 



